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Abstract —Active authentication is the problem of continuously 
verifying the identity of a person based on behavioral aspects 
of their interaction with a computing device. In this study, we 
collect and analyze behavioral biometrics data from 200 subjects, 
each using their personal Android mobile device for a period of 
at least 30 days. This dataset is novel in the context of active 
authentication due to its size, duration, number of modalities, 
and absence of restrictions on tracked activity. The geographical 
colocation of the subjects in the study is representative of a 
large closed-world environment such as an organization where 
the unauthorized user of a device is likely to be an insider threat: 
coming from within the organization. We consider four biometric 
modalities: (1) text entered via soft keyboard, (2) applications 
used, (3) websites visited, and (4) physical location of the device as 
determined from GPS (when outdoors) or WiFi (when indoors). 
We implement and test a classifier for each modality and organize 
the classifiers as a parallel binary decision fusion architecture. 
We are able to characterize the performance of the system 
with respect to intruder detection time and to quantify the 
contribution of each modality to the overall performance. 

Index Terms —Multimodal biometric systems, insider threat, in¬ 
trusion detection, behavioral biometrics, decision fusion, active 
authentication, stylometry, GPS location, web browsing behavior, 
application usage patterns 

I. Introduction 

According to a 2013 Pew Internet Project study of 2076 people 
[1], 91% of American adults own a cellphone. Increasingly, 
people are using their phones to access and store sensitive 
data. The same study found that 81% of cellphone owners use 
their mobile device for texting, 52% use it for email, 49% 
use it for maps (enabling location services), and 29% use it 
for online banking. And yet, securing the data is often not 
taken seriously because of an inaccurate estimation of risk as 
discussed in [2] . In particular, several studies have shown that 
a large percentage of smartphone owners do not lock their 
phone: 57% in [3], 33% in [4], 39% in [2], and 48% in this 
study. 

Active authentication is an approach of monitoring the behav¬ 
ioral biometric characteristics of a user’s interaction with the 


device for the purpose of securing the phone when the point- 
of-entry locking mechanism fails or is absent. In recent years, 
continuous authentication has been explored extensively on 
desktop computers, based either on a single biometric modality 
like mouse movement [5] or a fusion of multiple modalities 
like keyboard dynamics, mouse movement, web browsing, 
and stylometry [6]. Unlike physical biometric devices like 
fingerprint scanners or iris scanners, these systems rely on 
computer interface hardware like the keyboard and mouse that 
are already commonly available with most computers. 

In this paper, we consider the problem of active authentication 
on mobile devices, where the variety of available sensor data 
is much greater than on the desktop, but so is the variety of 
behavioral profiles, device form factors, and environments in 
which the device is used. Active authentication is the approach 
of verifying a user’s identity continuously based on various 
sensors commonly available on the device. We study four rep¬ 
resentative modalities of stylometry (text analysis), application 
usage patterns, web browsing behavior, and physical location 
of the device. These modalities were chosen, in part, due 
to their relatively low power consumption. In the remainder 
of the paper these four modalities will be referred to as 
TEXT, APR, WEB, and LOCATION, respectively. We consider 
the trade-off between intruder detection time and detection 
error as measured by false accept rate (FAR) and false reject 
rate (FRR). The analysis is performed on a dataset collected 
by the authors of 200 subjects using their personal Android 
mobile device for a period of at least 30 days. To the best 
of our knowledge, this dataset is the first of its kind studied 
in active authentication literature, due to its large size [7], the 
duration of tracked activity [8], and the absence of restrictions 
on usage patterns and on the form factor of the mobile device. 
The geographical colocation of the participants, in particular, 
makes the dataset a good representation of an environment 
such as a closed-world organization where the unauthorized 
user of a particular device will most likely come from inside 
the organization. 

We propose to use decision fusion in order to asynchronously 
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integrate the four modalities and make serial authentication 
decisions. While we consider here a specific set of binary 
classifiers, the strength of our decision-level approach is that 
additional classifiers can be added without having to change 
the basic fusion rule. Moreover, it is easy to evaluate the 
marginal improvement of any added classifier to the overall 
performance of the system. We evaluate the multimodal con¬ 
tinuous authentication system by characterizing the error rates 
of local classifier decisions, fused global decisions, and the 
contribution of each local classifier to the fused decision. The 
novel aspects of our work include the scope of the dataset, 
the particular portfolio of behavioral biometrics in the context 
of mobile devices, and the extent of temporal performance 
analysis. 

The remainder of the paper is structured as follows. In §11, we 
discuss the related work on multimodal biometric systems, 
active authentication on mobile devices, and each of the 
four behavioral biometrics considered in this paper. In §III, 
we discuss the 200 subject dataset that we collected and 
analyzed. In §IV, we discuss four biometric modalities, their 
associated classifiers, and the decision fusion architecture. In 
§V, we present the performance of each individual classifier, 
the performance of the fusion system, and the contribution of 
each individual classifier to the fused decisions. 

II. Related Work 

A. Multimodal Biometric Systems 

The window of time based on which an active authentication 
system is tasked with making a binary decision is relatively 
short and thus contains a highly variable set of biometric 
information. Depending on the task the user is engaged in, 
some of the biometric classifiers may provide more data 
than others. For example, as the user chats with a friend 
via SMS, the text-based classifiers will be actively flooded 
with data, while the web browsing based classifiers may only 
get a few infrequent events. This motivates the recent work 
on multimodal authentication systems where the decisions of 
multiple classifiers are fused together [9]. In this way, the 
verification process is more robust to the dynamic nature 
of human-computer interaction. The current approaches to 
the fusion of classifiers center around max, min, median, or 
majority vote combinations [10]. When neural networks are 
used as classifiers, an ensemble of classifiers is constructed and 
fused based on different initialization of the neural network 
[ 11 ]. 

Several active authentication studies have utilized multimodal 
biometric systems but have all, to the best of our knowledge: 
(1) considered a smaller pool of subjects, (2) have not char¬ 
acterized the temporal performance of intruder detection, and 
(3) have shown overall significantly worse performance than 
that achieved in our study. 

Our approach in this paper is to apply the Chair-Varshney 
optimal fusion rule [12] for the combination of available 
multimodal decisions. The strength of the decision-level fusion 


approach is that an arbitrary number of classifiers can be 
added without re-training the classifiers already in the system. 
This modular design allows for multiple groups to contribute 
drastically different classification schemes, each lowering the 
error rate of the global decision. 

B. Mobile Active Authentication 

With the rise of smartphone usage, active authentication on 
mobile devices has begun to be studied in the last few years. 
The large number of available sensors makes for a rich feature 
space to explore. Ultimately, the question is the one that we ask 
in this paper: what modality contributes the most to a decision 
fusion system toward the goal of fast, accurate verification 
of identity? Most of the studies focus on a single modality. 
For example, gait pattern was considered in [7] achieving 
an EER of 0.201 (20.1%) for 51 subjects during two short 
sessions, where each subject was tasked with walking down a 
hallway. Some studies have incorporated multiple modalities. 
For example, keystroke dynamics, stylometry, and behavioral 
profiling were considered in [13] achieving an EER of 0.033 
(3.3%) from 30 simulated users. The data for these users 
was pieced together from different datasets. To the best of 
our knowledge, the dataset that we collected and analyzed is 
unique in all its key aspects: its size (200 subjects), its duration 
(30 -f days), and the size of the portfolio of modalities that were 
all tracked concurrently with a synchronized timestamp. 

C. Stylometry, Web Browsing, Application Usage, Location 

Stylometry is the study of linguistic style. It has been ex¬ 
tensively applied to the problems of authorship attribution, 
identification, and verification. See [14] for a thorough sum¬ 
mary of stylometric studies in each of these three problem 
domains along with their study parameters and the resulting 
accuracy. These studies traditionally use large sets of features 
(see Table II in [15]) in combination with support vector 
machines (SVMs) that have proven to be effective in high 
dimensional feature space [16], even in cases when the number 
of features exceeds the number of samples. Nevertheless, with 
these approaches, often more than 500 words are required 
in order to achieve adequately low error rates [17]. This 
makes them impractical for the application of real-time active 
authentication on mobile devices where text data comes in 
short bursts. While the other three modalities are not well 
investigated in the context of active authentication, this is not 
true for stylometry. Therefore, for this modality, we don’t rein¬ 
vent the wheel, and implement the n-gram analysis approach 
presented in [14] that has been shown to work sufficiently well 
on short blocks of texts. 

Web browsing, application usage, and location have not been 
studied extensively in the context of active authentication. The 
following is a discussion of the few studies that we are aware 
of. Web browsing behavior has been studied for the purpose 
of understanding user behavior, habits, and interests [18]. 
Web browsing as a source for behavioral biometric data was 
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considered in [19] to achieve average identification FAR/FRR 
of 0.24 (24%) on a dataset of 14 desktop computer users. 
Application usage was considered in [8], where cellphone 
data (from 2004) from the MIT Reality Mining project [20] 
was used to achieve 0.1 (10%) EER based on a portfolio of 
metrics including application usage, call patterns, and location. 
Application usage and movements patterns have been studied 
as part of behavioral profiling in cellular networks [8], [21], 
[22]. However, these approaches use position data of lower 
resolution in time and space than that provided by GPS on 
smartphones. To the best of our knowledge, GPS traces have 
not been utilized in literature for continuous authentication. 

III. Dataset 

The dataset used in this work contains behavioral biometrics 
data for 200 subjects. The collection of the data was carried 
out by the authors over a period of 5 months. The requirements 
of the study were that each subject was a student or employee 
of Drexel University and was an owner and an active user 
of an Android smartphone or tablet. The number of subjects 
with each major Android version and associated API level are 
listed in Table I. Nexus 5 was the most popular device with 
10 subjects using it. Samsung Galaxy S5 was the second most 
popular device with 6 subjects using it. 


Android Version 

API Level 

Subjects 

4.4 

19 

143 

4.1 

16 

16 

4.3 

18 

15 

4.2 

17 

9 

4.0.4 

15 

5 

2.3.6 

10 

4 

4.0.3 

15 

3 

2.3.5 

10 

3 

2.2 

8 

2 


TABLE I: The Android version and API level of the 200 
devices that were part of the study. 

A tracking application was installed on each subject’s device 
and operated for a period of at least 30 days until the subject 
came in to approve the collected data and get the tracking 
application uninstalled from their device. The following data 
modalities were tracked with 1-second resolution: 

• Text typed via soft keyboard. 

• Apps visited. 

• Websites visited. 

• Location (based on GPS or WiPi). 

The key characteristics of this dataset are its large size (200 
users), the duration of tracked activity (30 -f days), and the 
geographical colocation of its participants in the Philadelphia 
area. Moreover, we did not place any restrictions on usage 
patterns, on the type of Android device, and on the Android 
OS version (see Table I). 

There were several challenges encountered in the collection 
of the data. The biggest problem was battery drain. Due to 


the long duration of the study, we could not enable modalities 
whose tracking proved to be significantly draining of battery 
power. These modalities include front-facing video for eye 
tracking and face recognition, gyroscope, accelerometer, and 
touch gestures. Moreover, we had to reduce GPS sampling 
frequency to once per minute on most of the devices. 


Event 

Frequency 

Text 

23,254,478 

App 

927,433 

Web 

210,322 

Location 

143,875 


TABLE II: The number of events in the dataset associated with 
each of the four modalities considered in this paper. A TEXT 
event refers to a single character entered on the soft keyboard. 
An APR events refers to a new app receiving focus. A WEB 
event refers to a new url entered in the url box. A LOCATION 
event refers to a new sample of the device location either from 
GPS or WiPi. 

Table II shows statistics on each of the four investigated 
modalities in the corpus. The table contains data aggregated 
over all 200 users. The “frequency” here is a count of the 
number of instances of an action associated with that modality. 
As stated previously, the four modalities will be referred to as 
TEXT, APP, WEB, and “location.” Por TEXT, the action is a 
single keystroke on the soft keyboard. Por APP, the action is 
opening or bringing focus to a new app. Por WEB, the action 
is visiting a new website. Por LOCATION, no explicitly action 
is taken by the user. Rather, location is sampled regularly 
at intervals of 1 minute when GPS is enabled. As Table II 
suggests, TEXT events fire 1-2 orders of magnitude more 
frequently than the other three. 

600 - 
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200 Users (Ordered from Least to Most Active) 


Pig. 1: The duration of time (in hours) that each of the 200 
users actively interacted with their device.. 

The data for each user is processed to remove idle periods 
when the device is not active. The threshold for what is 
considered an idle period is 5 minutes. Por example, if the 
time between event A and event B is 20 minutes, with no other 
events in between, this 20 minutes is compressed down to 5 
minutes. The date and time of the event are not changed but the 
timestamp used in dividing the dataset for training and testing 
(see §V-A) is updated to refiect the new time between event 
A and event B. This compression of idle times is performed 
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in order to regularize periods of activity for cross validation 
that utilizes time-based windows as described in §V-A. The 
resulting compressed timestamps are referred to as “active 
interaction”. Fig. 1 shows the duration (in hours) of active 
interaction for each of the 200 users ordered from least to 
most active. 

Table III shows three top-20 lists: (1) the top-20 apps based on 
the amount of text that was typed inside each app, (2) the top- 
20 apps based on the number of times they received focused, 
and (3) the top-20 website domains based on the number of 
times a website associated with that domain was visited. These 
are aggregate measures across the dataset intended to provide 
an intuition about its structure and content, but the top-20 list 
is the same as that used for the the classifier model based on 
the WEB and APP features in BIV. 



Fig. 2: An aggregate heatmap showing a selection from the 
dataset of GPS locations in the Philadelphia area. 


Fig. 2 shows a heat map visualization of a selection from the 
dataset of GPS locations in the Philadelphia area. The subjects 
in the study resided in Philadelphia but traveled all over 
United States and the world. There are two key characteristics 
of the GPS location data. First, it is relatively unique to 
each individual even for people living in the same area of 
a city. Second, outside of occasional travel, it does not vary 
significantly from day to day. Human beings are creatures of 
habit, and in as much as location is a measure of habit, this 
idea is confirmed by the location data of the majority of the 
subjects in the study. 

IV. Classification and Decision Fusion 
A. Features and Classifiers 

The four distinct biometric modalities considered in our anal¬ 
ysis are (1) text entered via soft keyboard, (2) applications 
used, (3) websites visited, and (4) physical location of the 
device as determined from GPS (when outdoors) or WiFi 
(when indoors). We refer to these four modalities as TEXT, 
APP, WEB, and LOCATION, respectively. In this section we 
discuss the features that were extracted from the raw data of 


each modality, and the classifiers that were used to map these 
features into binary decision space. 

A binary classifier is constructed for each of the 200 users and 
4 modalities. In total, there are 800 classifiers, each producing 
either a probability that a user is valid P{Hi) (or a binary 
decision of 0 (invalid) or 1 (valid). The first class (Hi) for 
each classifier is trained on the valid user’s data and the second 
class (Hq) is trained on the other 199 users’ data. The training 
process is described in more detail in §V-A. For APP, WEB, 
and LOCATION, the classifier takes a single instance of the 
event and produces a probability. For multiple events of the 
same modality, the set of probabilities is fused across time 
using maximum likelihood: 

i7* = argmax IT P(xt\Hi), (1) 

xten 

where ft = {xtlTcurrcni ~P{xt) < cc}, uj is a fixed window 
size in seconds, T{xt) is the timestamp of event Xf, and 
^current current timestamp. The process of fusing 

classifier scores across time is illustrated in Fig. 3. 

1) Text: As Table Ilia indicates, the apps into which text was 
entered on mobile devices varied, but the activity in majority 
of the cases was communication via SMS, MMS, WhatsApp, 
Facebook, Google Hangouts, and other chat apps. Therefore, 
TEXT events fired in short bursts. The tracking application cap¬ 
tured the keys that were touched on the keyboard and not the 
autocorrected result. Therefore, the majority of the typed mes¬ 
sages had a lot of misspellings and words that were erased in 
the final submitted message. In the case of SMS, we also were 
able to record the submitted result. For example, an SMS text 
that was submitted as “Sorry couldn't call back.” 
had associated with it the following recorded keystrokes: 
“Sprry coyld cpuldn't vsll back Classification 
based on the actual typed keys in principle is a better represen¬ 
tation of the person’s linguistic style. It captures unique typing 
idiosyncrasies that autocorrect can conceal. As discussed in 
§11, we implemented a one-feature n-gram classifier from [14] 
that has been shown to work well on short messages. It works 
by analyzing the presence or absence of n-grams with respect 
to the training set. 

2) App and Web: The APP and WEB classifier models we 
construct are identical in their structure. For the APP modality 
we use the app name as the unique identifier and count the 
number of times a user visits each app in the training set. 
For the WEB modality we use the domain of the URL as 
the unique identifier and count the number of times a user 
visits each domain in the training set. Note that, for example, 
“m.facebook.com” is a considered a different domain than 
“www.facebook.com” because the subdomain is different. In 
this section we refer to the app name and the web domain 
as an “entity”. Table Illb and Table IIIc show the top entities 
aggregated across all 200 users for APP and WEB respectively. 

For each user, the classification model for the valid class 
is constructed by determining the top 20 entities visited by 
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App Name 

Keys Per App 

com.android.sms 

5,617,297 

com.android.mms 

5,552,079 

com.whatsapp 

4,055,622 

com.facebook.orca 

1,252,456 

com.google.android.talk 

1,147,295 

com.infraware.polarisviewerd 

990,319 

com.android.chrome 

417,165 

com.facebook.katana 

405,267 

com.snapchat.android 

377,840 

com.google.android.gm 

271,570 

com.htc. sense.mms 

238,300 

com.tencent.mm 

221,461 

com.motorola.messaging 

203,649 

com.android.calculator2 

167,435 

com.verizon.messaging.vzmsgs 

137,339 

com.groupme.android 

134,896 

com.handcent.nextsms 

123,065 

com.jb.gosms 

118,316 

com.sonyericsson.conversations 

114,219 

com.twitter.android 

92,605 


App Name 

Visits 

TouchWiz home 

101,151 

WhatsApp 

64,038 

Messaging 

60,015 

Launcher 

39,113 

Facebook 

38,591 

Google Search 

32,947 

Chrome 

32,032 

Snapchat 

23,481 

System UI 

22,772 

Phone 

19,396 

Gmail 

19,329 

Messages 

19,154 

Contacts 

18,668 

Hangouts 

17,209 

Home 

16,775 

HTC Sense 

16,325 

YouTube 

14,552 

Xperia Home 

13,639 

Instagram 

13,146 

Settings 

12,675 


Website Domain 

Visits 

www.google.com 

19,004 

m.facebook.com 

9,300 

www.reddit.com 

4,348 

forums.huaren.us 

3,093 

learn.dcollege.net 

2,133 

en.m.wikipedia.org 

1,825 

mail.drexel.edu 

1,520 

one.drexel.edu 

1,472 

login.drexel.edu 

1,462 

likes.com 

1,361 

mail.google.com 

1,292 

i.imgur.com 

1,132 

www.amazon.com 

1,079 

netc ontrol. irt. drexel. edu 

1,049 

www.facebook.com 

903 

banner.drexel.edu 

902 

m.hupu.com 

824 

t.co 

801 

duapp2.drexel.edu 

786 

m.ign.com 

725 


(a) (b) (c) 

TABLE III: Top 20 apps ordered by text entry and visit frequency and top 20 websites ordered by visit frequency. These tables 
are provided to give insight into the structure and content of the dataset. 


that user in the training set. The quantity of visits is then 
normalized so that the 20 frequency values sum to 1. The 
classification model for the invalid class is constructed by 
counting the number of visit by the other 199 users to those 
same 20 domains, such that for each of those domains we now 
have a probability that a valid user visits it and an invalid user 
visits it. The evaluation for each user given the two empirical 
distributions is performed by the maximum likelihood product 
in (1). Entities that do not appear in the top 20 are considered 
outliers and are ignored in this classifier. 

3) Location: Location is specified as a pair of values: latitude 
and longitude. Classification is performed using support vector 
machines (SVMs) [23] with the radial basis function (RBE) 
as the kernel function. The SVM produces a classification 
score for each pair of latitude and longitude. This score is 
calibrated to form a probability using Platt scaling [24] which 
requires an extra logistic regression on the SVM scores via 
an additional cross-validation on the training data. All of the 
code in this paper is written by the authors except for the SVM 
classifier. Since the authentication system is written in C-f-h, 
we used the Shark 3.0 machine learning library for the SVM 
implementation. 


B. Decision Fusion 

Decision fusion with distributed sensors is described by Ten¬ 
ney and Sandell in [25] who studied a parallel decision 
architecture. As described in [26], the system comprises of 
n local detectors, each making a decision about a binary 
hypothesis {Hq, Hi), and a decision fusion center (DEC) that 
uses these local decisions for 3 . global decision 

about the hypothesis. The detector collects K observations 
before it makes its decision, Ui. The decision is = 1 if the 
detector decides in favor of Hi and Ui = —1 if it decides 


in favor of Hq. The DEC collects the n decisions of the 
local detectors and uses them in order to decide in favor of 
Ho{u = —1) or in favor of Hi{u = 1). Tenney and Sandell 
[25] and Reibman and Nolle [27] studied the design of the 
local detectors and the DEC with respect to a Bayesian cost, 
assuming the observations are independent conditioned on the 
hypothesis. The ensuing formulation derived the local and 
DEC decision rules to be used by the system components for 
optimizing the system-wide cost. The resulting design requires 
the use of likelihood ratio tests by the decision makers (local 
detectors and DEC) in the system. However the thresholds 
used by these tests require the solution of a set of nonlinear 
coupled differential equations. In other words, the design of 
the local decision makers and the DEC are co-dependent. In 
most scenarios the resulting complexity renders the quest for 
an optimal design impractical. 

Chair and Varshney in [12] developed the optimal fusion rule 
when the local detectors are fixed and local observations are 
statistically independent conditioned on the hypothesis. Data 
Eusion Center is optimal given the performance characteristics 
of the local fixed decision makers. The result is a suboptimal 
(since local detectors are fixed) but computationally efficient 
and scalable design. In this study we use the Chair-Varshney 
formulation. The parallel distributed fusion scheme (see Eig. 3) 
allows each classifier to observe an event, minimize the local 
risk and make a local decision over the set of hypothesis, 
based on only its own observations. Each classifier sends out 
a decision of the form: 

1, if i7i is decided 

( 2 ) 

— 1, if i7o is decided 

The fusion center combines these local decisions by mini¬ 
mizing the global Bayes’ risk. The optimum decision rule 
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Fig. 3: The fusion architecture across time and across classifiers. The TEXT, APR, WEB, and LOCATION boxes indicate a firing 
of a single event associated with each of those modalities. Multiple classifier scores from the same modality are fused via (1) 
to produce a single local binary decision. Local binary decisions from each of the four modalities are fused via (4) to produce 
a single global binary decision. 


performs the following likelihood ratio test 

P{ui,...,Un\Hi) ^ ^ . 2 ) 

P{ui, ...,Un\Ho) Ho Pi 

where the a priori probabilities of the binary hypotheses Hi 
and Hq are Pi and Pq respectively. In this case the general 
fusion rule proposed in [12] is 


algorithm. We then tested the performance of the classifiers, 
individually and as part of the fusion system, on the fifth fold. 
This phase is referred to as “testing” since this is the part 
that is used for evaluation the performance of the individual 
classifiers and the fusion system. The three phases of training, 
characterization, and testing as they relate to the data folds are 
shown in Fig. 4. 




1, if ao + X]r=o 0 

— 1, Otherwise 


(4) 


• Training on folds 1, 2, 3. 
Characterization on fold 4. 
Testing on fold 5. 


with P^ , Pf representing the False Rejection Rate (FRR) and 
False Acceptance Rate (FAR) of the classifier respectively. 
The optimum weights minimizing the global probability of 
error are given by 

log L (5) 

Fo 

{ l_pM 

log-^, if Wi = 1 

l_pF (6) 

log-pit^, if Mj = -1 

The threshold in (3) requires knowledge of the a priori 
probabilities of the hypotheses. In practice, these probabilities 
are not available, and the threshold r is determined using 
different considerations such as fixing the probability of false 
alarm or false rejection as is done in §V-C. 

V. Results 

A. Training, Characterization, Testing 

The data of each of the 200 users’ active interaction with 
the mobile device was divided into 5 equal-size folds (each 
containing 20% time span of the full set). We performed 
training of each classifier on the first three folds (60%). We 
then tested their performance on the fourth fold. This phase 
is referred to as “characterization”, because its sole purpose 
is to form estimates of FAR and FRR for use by the fusion 


• Training on folds 2, 3, 4. 
Characterization on fold 5. 
Testing on fold 1. 

• Training on folds 3, 4, 5. 
Characterization on fold 1. 
Testing on fold 2. 

• Training on folds 4, 5, 1. 
Characterization on fold 2. 
Testing on fold 3. 

• Training on folds 5, 1, 2. 
Characterization on fold 3. 
Testing on fold 4. 


Class 1: Accept Training Characterization Testing 


1 W W 1 

User 1 

60% of user 1 data 

20% of user 1 

20% of user 1 | 

Class 2: Reject 

Training 

Ji 

Characterization 

Testing 

1 

1 \l ] 

User 2 

60% of user 2 data 

20% of user 2 

20% of user 2 


User 3 

60% of user 3 data 

20% of user 3 

20% of user 3 | 


User 67 

60% of user 67 data 

1 [ 20% of user 67 | 

20% of user 67 | 


Fig. 4: The three phases of processing the data to determine the 
individual performance of each classifiers and the performance 
of the fusion system that combines some subset of these 
classifiers. 
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The common evaluation method used with each classifier for 
data fusion was measuring the averaged error rates across five 
experiments; In each experiment, data of 3 folds was taken 
for training, 1 fold for characterization, and 1 for testing. 
The FAR and FRR computed during characterization were 
taken as input for the fusion system as a measurement of 
the expected performance of the classifiers. Therefore each 
experiment consisted of three phases: 1) train the classifier(s) 
using the training set, 2) determine FAR and FRR based on 
the training set, and 3) classify the windows in the test set. 


B. Performance: Individual Classifiers 




Fig. 5: FAR and FRR performance of the individual classifiers 
associated with each of the four modalities. Each bar represent 
the average error rate for a given module and time window. 
Each of the 200 users has 2 classifiers for each modality, 
so each bar provides a value that was averaged over 200 
individual error rates. The error bar indicate the standard 
deviation across these 200 values. 

The conflicting objectives of an active authentication system 
are of response-time and performance. The less the system 
waits before making an authentication decision, the higher 
the expected rate of error. As more behavioral biometric data 
trickles in, the system can, on average, make a classification 
decision with greater certainty. 

This pattern of decreased error rates with an increased deci¬ 
sion window can be observed in Fig. 5 that shows (for 10 


different time windows) the FAR and FRR of the 4 classifiers 
averaged over the 200 users with the error bars indicating 
the standard deviation. The “testing fold” (see §V-A) is used 
for computing these error rates. The “characterization fold” 
does not affect these results, but is used only for FAR/FRR 
estimation required by the decision fusion center in §V-C. 

The “time before decision” is the time between the first 
event indicating activity and the first decision produced by 
the fusion system. This metric can be thought of as “decision 
window size”. Events older than the time range covered by 
the time-window are disregarded in the classification. If no 
event associated with the modality under consideration fires 
in a specific time window, no error is added to the average. 


Event 

Firing Rate (per hour) 

Text 

557.8 

App 

23.2 

Web 

5.6 

Location 

3.5 


TABLE IV: The rates at which an event associated with 
each modality “fires” per hour. On average, GPS location is 
provided only 3.5 times an hour. 

There are two notable observations about the EAR/ERR plots 
in Eig. 5. Eirst, the location modality provides the lowest 
error rates even though on average across the dataset it fires 
only 3.5 times an hour as shown in Table IV. This means 
that classification on a single GPS coordinate is sufficient to 
correctly verify the user with an PAR of under 0.1 and an ERR 
of under 0.05. Second, the text modality converges to an PAR 
of 0.16 and an ERR of 0.11 after 30 minutes which is one 
of the worse performers of the four modalities, even though 
it fires 557.8 times an hour on average. At the 30 minute 
mark, that firing rate equates to an average text block size of 
279 characters. An PAR/PRR of 0.16/0.11 with 279 characters 
blocks improves on the error rates achieved in [14] with 500 
character blocks which in turn improved on the errors rates 
achieved in prior work for blocks of small text (see [14] for 
a full reference list on short-text stylometric analysis). 

C. Performance: Decision Fusion 

The events associated with each of the 4 modalities fire at 
very different rates as shown in Table IV. Moreover, text 
events fire in bursts, while the location events fire at regularly 
spaced intervals when GPS signal is available. The app and 
web events fire at varying degrees of burstiness depending 
on the user. Pig. 6 shows the distribution of the number of 
events that fire within each of the time windows. An important 
takeaway from these distributions is that most events come in 
bursts followed by periods of inactivity. This results in the 
counterintuitive fact that the 1 minute, 10 minute, and 30 
minute windows have a similar distribution on the number 
of events that fire within them. This is why the decrease in 
error rates attained from waiting longer for a decision is not 
as significant as might be expected. 

Asynchronous fusion of classification of events from each of 
the four modalities is robust to the irregular rates at which 
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Number of Events Fired in Time Window 

Fig. 6: The distribution of the number of events that fire within 
a given time window. This is a long tail distribution as non¬ 
zero probabilities of event frequencies above 13 extend to 
over 100. These outliers are excluded from this histogram plot 
in order to highlight the high-probability frequencies. Time 
windows in which no events fire are not included in this plot. 


events fire. The decision fusion rule in (4) utilizes all the 
available biometric data, weighing each classifier according 
to its prior performance. Fig. 7 shows the receiver operating 
characteristic (ROC) curve trading off between FAR and FRR 
by varying the threshold parameter r in (3). 
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Fig. 7: The performance of the fusion system with 4 classifiers 
on the 200 subject dataset. The ROC curve shows the tradeoff 
between FAR and FRR achieved by varying the threshold 
parameter ao in (4). 


As the size of the decision window increases, the performance 


of the fusion system improves, dropping from an equal error 
rate (EER) of 0.05 using the 1 minute window to below 0.01 
EER using the 30 minute window. 


D. Contribution of Local Classifiers to Global Decision 
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Eig. 8: Relative contribution of each of the 4 classifiers 
computed according to (7). 


The performance of the fusion system that utilizes all four 
modalities of TEXT, APP, WEB, and LOCATION is described in 
the previous section. Besides this, we are able to use the fusion 
system to characterize the contribution of each of the local 
classifiers to the global decision. This is the central question 
we consider in the paper: what biometric modality is most 
helpful in verifying a person’s identity under a constraint of 
a specific time window before the verification decision must 
be made? We measure the contribution Ci of each of the four 
classifiers by evaluating the performance of the system with 
and without the classifier, and computing the contribution by: 




Ei-E 

E^ 


(7) 


where E is the error rate computed by averaging EAR and ERR 
of the fusion system using the full portfolio of 4 classifiers, 
Ei is the error rate of the fusion system using all but the 
i-th classifier, and Ci is the relative contribution of the i-th 
classifier as shown in Eig. 8. We consider the contribution 
of each classifier under three time windows of 1 minute, 10 
minutes, and 30 minutes. Location contributes the most in all 
three cases, with the second biggest contributor being web 
browsing. Text contributes the least for the small window of 
1 minute, but improve for the large windows. App usage is 
the least predictable contributor. One explanation for the APP 
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modality contributing significantly under the short decision 
window is that the first app opened in a session is a strong 
and frequent indicator of identity. Therefore, its contribution 
is high for short decision windows. 

VI. Conclusion 

In this work, we proposed a parallel binary decision-level 
fusion architecture for classifiers based on four biometric 
modalities: text, application usage, web browsing, and loca¬ 
tion. Using this fusion method we addressed the problem of 
active authentication and characterized its performance on a 
real-world dataset of 200 subjects, each using their personal 
Android mobile device for a period of at least 30 days. The 
authentication system achieved an equal error rate (ERR) of 
0.05 (5%) after 1 minute of user interaction with the device, 
and an EER of 0.01 (1%) after 30 minutes. We showed the 
performance of each individual classifier and its contribution 
to the fused global decision. The location-based classifier, 
while having the lowest firing rate, contributes the most to 
the performance of the fusion system. 
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